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Abstract 


The degree distribution is one of the most fundamental graph properties of interest for real- 
world graphs. It has been widely observed in numerous domains that graphs typically have 
a tailed or scale-free degree distribution. While the average degree is usually quite small, the 
variance is quite high and there are vertices with degrees at all scales. We focus on the problem 
of approximating the degree distribution of a large streaming graph, with small storage. We 
design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees 
using truncated geometric random variables. We give a mathematical analysis of headtail and 
show that it has excellent behavior in practice. We can process streams with millions of edges 
with storage less than 1% and get extremely accurate approximations for all scales in the degree 
distribution. 

We also introduce a new notion of Relative Hausdorff distance between tailed histograms. 
Existing notions of distances between distributions are not suitable, since they ignore infrequent 
degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a 
more suitable distance for comparing degree distributions. By tracking this new measure, we 
are able to give strong empirical evidence of the convergence of headtail. 

1 Introduction 

Graphs are a natural abstraction for any data set with entities and relationship between them. 
Popular examples include online social networks such as Facebook and Twitter; transportation 
networks; biological networks such as protein-protein interaction and metabolic networks; and 
communication networks such as the internet and telephone and email networks. Many of these 
graphs are most naturally represented by a stream of edges. Especially for social and communication 
networks, each edge has an associated timestamp, and the graph is basically an aggregate of all 
these edges over some time window. Such streams are typically quite massive; social networks like 
Facebook and Twitter can generate billions of communication links in a day [1, 2]. A publicly 
available HTTP request dataset has billions of requests [3]. The scale of these data sizes has 

*Work was done while the author was an intern at Sandia National Laboratories, Livermore. 
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led to interest in small-space streaming algorithms. Such algorithms accurately compute specific 
properties of the total graph, using a memory footprint that is orders of magnitude smaller in size. 

Arguably, one of the most important properties of real-world networks is the degree distribution. 
Seminal papers in massive graph analysis studied precisely this quantity [4, 5, 6 ]. The study of 
degree distributions is probably the birthplace of real-world network analysis. It has been found 
to be relevant for graph modeling, network resilience, and algorithmics [7, 8 , 9, 10, 11, 12, 13]. 
One of the key discoveries of network analysis is the presence of scale-free or heavy-tailed degree 
distributions. The average degree of a node is usually small, but there are nodes with degrees at 
all scales. The very notion of a scale-free network has entered the common parlance because of its 
relevance to network analysis [14]. 

1.1 Problem statement 

The input is a stream of edges ei, 62 ,..., without any repetitions. The graph created by these 
edges is denoted G = (F, E). For convenience, we set F = [n], though the labels may be from some 
arbitrary discrete universe. We do not assume that the algorithm knows n and m, the number of 
vertices and edges respectively. Each edge is represented by a pair (?i, v) of vertex labels. 

For vertex v ^ dy denotes its degree (the number of neighbors of 'u). We set n{d) to be the 
number of vertices of degree d, and N (d) to be the number of vertices of degree at least d. In math, 
N{d) = It is convenient for us to work with unnormalized raw counts, so we deal with 

histograms rather than distributions. We denote the sequence {n{d)} by the degree histogram (dh) 
and {A^(d)} is the complementary cumulative degree histogram^ (ccdh). When {n{d)} is normalized 
by n, it is called the degree distribution. We focus on the ccdh, instead of the dh. Typically, the dh 
is quite noisy in real data, and the ccdh has the added benefit of being monotonically decreasing. 
(Focus on the ccdh is standard for fitting procedures [15].) 

We study the problem of approximating the ccdh of G using a small-space one-pass streaming 
algorithm. Such an algorithm has some limited memory, denoted M. It sees the edges in stream 
order, and on seeing edge e^, updates the memory M. The algorithm cannot access older edges, 
and M is typically order of magnitudes smaller than the size of the stream. At the end of the 
stream, the algorithm reports a sequence {A^((i)}, an approximation to the ccdh of G. 

We make no assumption on the ordering of edges. We do not consider edge deletions or edge 
repetitions. (This is the standard model used in most work on practical streaming algorithms.) 

1.2 Challenges 

How does a small-space algorithm estimate the degree distribution at all scales? The 

degree distribution involves degrees at “all” scales: many low degree vertices, some intermediate 
degree vertices, and few very high degree vertices. Look at Fig. la for the ccdh of a router topology 
network. The average degree is 20, but there are vertices with degrees up to 50, 000. The count 
of low degree vertices is easy to estimate, since a simple random sample of vertices gives a good 
estimate. Intermediate and high degrees pose a problem. There are few such vertices but it is critical 
to sample their count accurately. There is a huge literature on estimating distribution properties 
of a stream of items: frequent items, distribution moments, distinct items, etc. [17, 18, 19]. (We 
discuss in depth later.) But these only give specific properties of the distribution. None of these 

^This is often called the cumulative degree distribution, but that is counter to the standard dehnition for probability 
distributions. 
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(a) as-Skitter: n = 1.7M, m = (b) com-LiveJournal: n = 4M, 

IIM, storage = 31K m = 34M, storage = 200i^ 



(c) com-Orkut: n = 3M, m = 
117M, storage = ISOiC 


Figure 1: The output ccdh of headtail on three different input graphs from the SNAP [16] collec¬ 
tion. In each case, the storage is less 1% of the stream (and less than 5% of the number of vertices). 
Observe the near identical match with the true ccdh. 


methods can get frequency estimates at all scales^ ranging continuously from (frequent) low degrees 
to (infrequent) high degrees. 

How to quantitatively compare (cumulative) degree distributions? How do we actually 
assert that our algorithm is any good? One can use standard statistical distance measures like 
Kolmogorov-Smirnov. Yet these measures typically ignore the tail since it contains a negligible 
fraction of vertices. Consider the following examples. We take a clique of n vertices and a clique 
of n — 1 vertices. It is natural to say that their degree distributions are quite close, but no popular 
existing measure would assert that. On the other end, consider a star with n edges, and a matching 
with n edges. The degree distribution only differs at one “point”, the vertex of degree n. Yet we 
would consider the degree distributions to be fundamentally different. Most statistical measures 
would say they are similar, since they differ at only a single outlier. 

An intuitive notion of similarity is closeness in log-log plots, but how do we quantify such a 
concept? One might try to approximate degree distributions by closed-form, but fitting procedures 
are notoriously tricky for tailed distributions and subject to much error [15]. 

1.3 Main results 

The algorithm headtail: Our main contribution is a new small-space algorithm headtail that 
estimates the ccdh of an input graph stream. The novelty is a new estimator for infrequent degree 
counts, which is combined with standard sampling to give ccdh estimates at all scales. We represent 
the sampling of headtail through certain truncated geometric random variables. An analysis of 
their behavior provides the right “correction” factors to infer the ccdh from our sampling. We 
provide a detailed mathematical analysis of headtail explaining why it accurately estimates the 
ccdh. Our analysis falls short of a complete proof, and we rely on some heuristic arguments for the 
full argument. 

Relative Hausdorff distance: We introduce a new notion of distance between ccdhs (tech¬ 
nically, between any two histograms) called the Relative Hausdorff (RH) distance. This distance 
avoids the pitfalls of standard measures, and is able to capture the closeness at all scales. Intu¬ 
itively, a small RH-distance implies that every point in one ccdh is “close” (up to relative error) 
to some point in the other ccdh. Put another way, both ccdhs agree at all scales, and agree on 
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outliers. While this condition is quite stringent, RH distance is flexible enough to allow for minor 
errors. It gives a concrete way of quantifying the quality of headtail, and empirically establishing 
convergence of our estimate. 

Empirical behavior of headtail: We run headtail on a wide variety of public graph 
datasets. It gives excellent estimates of the ccdh in all our tests, for storage less than 1% of 
the stream. We show example outputs in Fig. 1, for three different input graphs. In each case, 
observe the near perfect match with the true ccdh, at all degrees. We compute the RH distance 
for numerous runs and demonstrate convergence of headtail’s output with increasing storage. In 
all our runs, storage around 1% of the stream is sufficient for excellent match in ccdhs (and also 
for low RH-distance). 

1.4 Related Work 

Note that we can frame our problem in terms of general histogram estimation. If one views the 
input as a stream of vertex labels, then the dh (and ccdh) is the histogram of label frequencies. 
There is much work on understanding frequencies in a discrete stream, but as we detail below, none 
of this work solves the problem of estimating the ccdh. 

Finding frequent items, aka “heavy hitters,” is a classic problem in the data stream model. 
Cormode and Hadjieleftheriou [19] compare three of the most important algorithms: the frequent 
algorithm [20, 21, 22], the lossy eounting algorithm [23], and the spaee saving algorithm [24].^ For 
large degrees, these approaches will give accurate results, but the error term dwarfs the degree 
at smaller scales. We demonstrate this empirically in Section 5. Much work has been done in 
approximating frequency moments [27, 17, 18, 19], but they do not give an estimate for multiple 
scales. Nor has this work been implemented in practice for large data sets. 

Rather than just finding frequent items, Korn et al. [28] attempt to estimate the entire dis¬ 
tribution of elements in the stream. However, in contrast to our work, their approach assumes 
that the distribution comes from a parameterized family of distributions, e.g., the distribution is 
Zipfian, and then focuses on estimating the relevant parameters. This approach is only applicable 
for graphs where the degree distribution is already relatively well understood. Despite much study 
and claims, there are no conclusive closed-form formulae for real-world degree distributions. The 
classic power law fitting work of Clauset et al. [15] argues why most previous methods are not 
statistically robust, and how one needs strong independence assumptions to get rigorous results. 
Therefore, headtail makes no closed form assumption on the input stream. 

Over the last ten years, there has been a growing body of work focused on processing graphs in 
the data stream model. See [29] for a summary of recent work on graph streaming and sketching. 
This work has included problems such as the number of triangles and related quantities such as the 
transitivity coefficient [30, 31, 32], estimating the connectivity properties of a graph [33], and solving 
combinatorial problems such as computing large matchings [34, 35]. Cormode and Muthukrishnan 
considered estimating properties of the degree distribution in multigraphs but not the distribution 
itself [36] . 

Closest to this work is the series of graph sampling papers by Ahmed et al. [37, 38, 39, 32]. Their 
work focuses on estimating many properties (as opposed to a single property) with a fixed sampling 
method, and they study various sampling schemes. The results on estimating ccdhs typically use 

^ Other popular algorithms such as CountSketch [25] and CountMin [26] enable frequent items to be identified 
when the frequency of an item may be incremented and decremented. 
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20-30% of the stream, with weaker empirical results [37]. The recent Graph Sample and Hold 
framework gives extremely strong results for triangle counting [32], but is not applied for the ccdh. 
This technique is closely related to an approach for estimating frequency moments [27, 40]. Our 
sampling approach is also similar, and our main contribution is in the actual estimation procedure. 

2 The algorithm 

The algorithm headtail has two parts: update and estimate. The procedure update is called 
for every edge in the stream, and simply updates the data structures. The procedure estimate 
is called at the end of the stream to get an estimate of {A^((i)}. In what follows, the subscript h 
refers to “head” and t is “tail”. 

The algorithm headtail requires two parameters, ph and pt^ which are probabilities. These 
decide the storage requirements of the algorithm, as explained later. For convenience, we will 
assume these are global variables, and will not pass them around to each function. 

We will assume the existence of a hash function hash that maps strings uniformly to [0,1]. 

Data Structures: There are two sets of vertices Sh and St^ and corresponding maps ct^ : 
Sh ^ ^ and ctt : iSt 1 -^ N. Again, we assume these are global variables. 

The procedure update: This updates the data structures for each edge in the stream. Con¬ 
sider edge {u^v) in the stream. If 'L’ G 5/^, the cth{v) is incremented (analogously for St). Now for 
the critical difference between and Sf. If v ^ Sh and if hash(?;) < ph^ then v is added to Sh- If 
V ^ Sf. we insert v to St with probability pt. (The entire operation above is also done for u.) Note 
the difference: for Sh^ we essentially flip a random coin for the vertex. For we flip a coin for 
the edge. Intuitively, Sh is maintaining a uniform random set of vertices. On the other hand, St 
maintains sample of vertices biased towards higher degree. 

The procedure estimate: This procedure uses Sh^ St^cih^cit to output an estimate {N{d)} 
for the ccdh of G. We set Ch{r) to be the number of vertices in Sh with cth{’) value of r (similarly 
for Ct(r)). One can think of this as the “observed” degree distribution. The scaling of Ch{r) 
is straightforward: we simply consider Ch{r)/ph to be an estimate of n{r). By summing these 
appropriately, we get an estimate (the head estimate) of N{ry 

For Ct, we first do an additive “correction”. So we set Ct{r) — Ct{r — i{r))^ where i{r) is a 
correction factor. The explanation of this factor is provided in Section 3. Then, we do a biased 
scaling and consider Ct(r)/(1 — (1 — ptY) as an estimate of n{r). Again, by taking partial sums, 
we have an estimate (the tail estimate) of N{r). 

Observe that we have two different estimates of N{r). We prove in our mathematical analysis 
that the former is accurate for the head of the distribution, while the latter is appropriate for the 
tail. This distinction is made by dthn which is chosen to ensure that the first estimate has low 
variance. Hence, for all degrees less than dthr^ we use the head estimate, and for the remaining, we 
use the tail estimate. 

We now give a formal description of the algorithm. 

For fixed pt G (0,1), we define i{r) to be: 


- l-pt- (1 -ptY^^ - rptjl -ptY ' 

Ptii - {I-PtY) 
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Algorithm 1: headtail(2)/i,pt) 

1 Initialize empty sets and St and empty mappings cth and ct^. 

2 For each edge = (u^v) in the stream, 

3 Call update (i/,'u). 

4 Call estimate to get output estimate for {N{d)}. 


Algorithm 2: update(i^,'y) 

1 If u ^ Sh^ increment cth{u). 

2 If u ^ Sh- if hash[u\ < ph-> insert u in Sh and set cth{u) — 1. 

3 If u ^ St ^ increment ctt(i/). 

4 If u ^ Sp with probability pt, insert u in St and set ctt{u) = 1. 

5 (Repeat above steps for v.) 


Algorithm 3: estimate 

1 Let Ch{r) be the number of vertices in S^ with count exactly r. (Similarly, define Ct{r)). 

2 For all counts r, set Ch{r) — Ch{r — ^(r)). 

3 For all counts r: 

4 Set gh{r) ^Ch{r)/ph- 

5 Set gtir) = Ct{r)/[1 - (1 - ptY]. 

6 Set dthr fo be largest d such that 

7 For all degrees d: 

8 It d < dthr, set N{d) ^Y.r>d9hir). 

9 It d> dthr, set N{d) = Y^^y^gt{r). 
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3 Mathematical Analysis 


We abstract out the behavior of the algorithm in a series of claims. We stress that all our theorems 
are independent of graph stream order, and hence estimate works for all orderings. 

Definition 1. For any positive integer s andp G (0,1), the truncated geometric distribution TGp^s 
has the pdf: VO < A; < s — 1, Pr[X = fc] = p(l — p)^/[l — (1 — p)^]- 

Observe that as s ^ oo, this is a standard geometric random variable. 

Lemma 1. For every v G [n], v is inserted in independently with probability ph- Conditioned 
on V ^ Sfi, et{v) — dy. 

Proof. We assume that hash is a uniform random function, so hash('?;) is uniformly distributed in 
(0,1). The probability that hash('?;) < ph is exactly p^. Observe that if hash('L’) < then v 
is inserted in at the very first occurrence of v in the stream. Hence ct('L’) = d{v)^ whenever 
V ^ Sh> □ 

Lemma 2. For every v G [n], v is inserted in St independently with probability 1 — (1 — ptY'^ • 
Conditioned on v ^ St, et{v) = dy — X, where X 

Proof There are dy occurrences of v in the stream. The probability of v being added in the bth 
occurrence is pt(l —ptY~^. When this happens, ct('?;) = dy — {b— 1). The probability that v is never 
added is Ylb=iPtY ~ PtY~^ = (1 — PtY""• Conditioned on v being added to St, the probability of v 
being added in the bth occurrence is exactly pt(l — ptY~^/[l — (1 — PtY^]- So 6 — 1 is distributed 
a,sTGp^^dv' n 

Lemma 3. The expeeted value of X ^ TCp^d 'Is ^^ • 

Proof Using the bound for the sum of an arithmetico-geometric series: 

P Y- y-, _ yk ^ P ( {l-d){l-pY , {1-p)-{1-pY 

l-(l-p)\Vo l-(l-p)n P ^ P^ 

1 — p — (1 — pY^^ — dp{l — pY 
p(l - (1 -p)^) 


□ 

This expression is exactly (up to rounding) i{d). Conditioned on G St, E[ct(?;)] is dy minus a 
“loss” term, which is precisely the expression in Lemma 3. That should hopefully explain the use 
of £{d) in our algorithm. We make the (admittedly wrong) assumption that every vertex of degree 
d in St “loses” exactly the expected loss. In other words, we assume that ct('L’) is E[ct('L’)]. To infer 
the number of degree d vertices in St, we add back the expected loss to each vertex in v. That is 
why we set Ct{r) = Ct{r — £{r)). 

It is fairly easy to bound the space and running time of headtail. 

Theorem 4. The expeeted spaee used by headtail is 0{phn + ptm). The expeeted running time 
o/update is 0(1); and the expeeted running time o/estimate is 0{phn -\- ptm). 
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Proof. We will store all sets as hash tables, to ensure 0(1) updates. By Lemma 1, each vertex 
is added to with probability p^. Hence, the expected size of Sh is 0{phn). For each edge in 
the stream, we potentially add a vertex to St with probability pt. Hence, the expected size of 
St is 0{ptm). (This is a gross upper bound, and a refined bound based on Lemma2 would be 
- {l-pf].) 

The processing of update only requires addition in set and count increments, and requires 0(1) 
time. The procedure estimate runs in time linear in the sets S^ and St- □ 

3.1 The estimators 

For the analysis of our estimators, we need to introduce various error parameters. Natually, the 
actual implementation estimate simply sets these to be fixed constants, so we make slight modi¬ 
fications and assumptions for convenience of analysis. 

Let 6: = (0,1) be an error parameter, and let c be a sufficiently large constant. 

• We set dthr fo be the largest d such that — (<^(log^)/^^)/p/i. (In the implemen¬ 

tation, we hardcoded cje^ to be 50.) 

• We assume that pt is chosen so that dthr ^ log(l/^)/2^t- 

We begin with the analysis of the head estimator, which is a straightforward Chernoff bound 
application. 

Lemma 5. For all d < dthr^ E[A^((i)] = N{d). With probability > I — l/n, for all d < dthr, 
\N{d)-N{d)\ <eN{d). 

Proof Fix some d < dthr- Note that the head estimator is used for N{d). Also, J2r>d9h{T^) 
precisely the number of vertices of degree at least d in Sh- For convenience, denote this by and 
observe that it is monotonically decreasing in d. By Lemma 1, each vertex is added independently 
to Sh with probability ph> Thus, E[Xrf] = ph • N{d). Note that N{d) is precisely Xdjph, so 
E[X(d)] =N{d). 

Since dthr is itself a random variable, we need a little care to prove the lemma. Observe that 
Xd is well-defined for all d, and is the sum of Bernoulli random variables. By a multiplicative 
Chernoff bound (refer to Theorem 1.1 in [41]), Pr[|X^ — E[X^]| < 6:E[X^]] < 2exp(—6:^E[X<^]/3). 
Furthermore, by an alternate bound, if H > eE[X], then Pr[X > B] < 2“^. 

When E[X^] = ph • N{d) > (c(logn)/3^^), apply the first bound. When E[X^] < (c(logn)/3^^), 
apply the second bound with B = c{logn)/e‘^. Finally, we apply the union bound over all errors, 
which a calculation shows to be < l/n. Hence, for any d where E[X<^] < c(logn)/3£:^, Xd < 
c{logn)/s‘^. So, dthr must be smaller than any such degree. Thus, for all d < dthr, E[X^] > 
c(logn)/36:^, and the first Chernoff bound gives the desired concentration. □ 

The more challenging part is to analyze the tail estimator. We fall short of giving a complete 
proof that it works. Nonetheless, we provide some mathematical evidence of its correctness. We 
provide a high level explanation of the math that follows. We warn the reader that we shall switch 
between estimates for N{d) and n{d). 

The weakness of the head estimator is made clear in the proof of the previous lemma. The 
Chernoff bounds says that the error probability of estimating of N{d) is roughly exp{—ph • N{d)). 
This goes to 1 as N{d) becomes smaller than l/ph> That is precisely what happens in the tail of the 
degree distribution, which contains fewer vertices of higher degree. In general, mild fluctuations in 


estimates for low degree vertices is ok (there are many of them), but even a little wagging in the 
tail estimates creates significant error. 

But high degree vertices are more likely to be in St by Lemma2. Let St{d) denote the subset of 
degree d vertices in S. We show in Lemma6 how to get an estimate of n{d) from \St{d)\^ where the 
error probabilities are roughly exp(— •d-n{d)). Note the extra d faetor. As long as d-n{d) > 1/pt, 
we can hope for concentration. In other words, even though high degree vertices are infrequent, it 
is provably possible to get accurate estimates for these counts. 

Unfortunately, it is not clear how to estimate \St{d)\^ since ctt{v) is quite different from 
dy. As mentioned earlier, we make the (admittedly erroneous) assumption that ctt{v) — dy — 
TP based on Lemma2 and Lemma3. This is used to predict the actual degree of 

V ^ St^ based on ctt{v). While this assumption is wrong because the truncated geometric distribu¬ 
tion has large variance, in practice, it works quite well. 

In estimate, the proxy for \St{d) \ is given by Ct{d). We show that the “ccdh” (or partial sums) 
of Ct{d) approximates those of |5t((i)|. In other words, we can a get a rough approximation for 
the number of vertices of degree at least d in St- This is what is proven in Theorem? and the 
subsequent calculations. 

We now proceed with the formal proofs. The following lemma provides an appropriate concen¬ 
tration bound for estimating n{d) from |*St((i)|. 


Lemma 6. For all d, E[|*St((i)|] = (1 — (1 —ptY)n{d). For all d > dthr ci'^d suffieiently small pt: 
with probability at least 1 — 2ex.p{—ept • d • n{d)/16), |*St((i)| — E[|*St((i)|] < £:E[|*St((i)|]. 


Proof. Every degree d vertex is added to S with probability 1 — (1 — pt)^ (for convenience, denote 
this by a). Linearity of expectation proves that E[|S't((i)|] = an{d). Note that \St{d)\ is the 
sum of Bernoulli random variables, each with expectation a. By the original Chernoff-Hoeffding 
bound [42], Pr[|S't((i)| < (1 — s)an{d)] < exp(—L)((a(l — e)\\a)n{d))^ where L^(-,-) denotes the 
KL-divergence. With some manipulations. 


D{a{l — 6:)||(a) = a{l — e) In —— + (1 — a{l — £:)) In -——- 

a 1 — a 

> a ln(l — e) + ae ln(l + ae/{l — a)) 


Now we use d > dthr ^ log(l/^)/Pt* ^ calculation yields [1 — (1 — pt)^]e/{l — pt)^ > 1/2 for 
sufficiently small pt. Hence, the expression above is bounded below by: 

—2£ -h a£ln(a£/(l — (a))/4 > —2£ -\- a£ln{a£)/4: — a£ln{l — pt)^lA 

> —4£ -h £ptdl8 > £ptdll6 


An analogous bound holds for the upper tail, and a union bound completes the proof. □ 

Hence, we would like to estimate \St{d)\ and divide by 1 — (1 — pt)^ to get estimates for n{d) 
(where d is large). Our estimate for \St{d)\ is C((i), and this scaling is precisely what is done in 
estimate. 


Definition 2. • The edf of TGp^^s, formally Cp^^s{h) — JX < A:] = [1 — (1 — 
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• red{d) = d — i{d). 

Indeed, we will show that the “ccdh” of \St{d)\ is somewhat approximated by that of C{d). 
Theorem 7. nEr>dC{d)] = Er>redid) ^nAr - red{dm\St{r)\]. 

Proof. Note that i{d) is monotonically increasing in d. Any v ^ S such that ctt{v) > red((i) will 
be counted as part of C(r), for some r > d. The quantity Ioss('L’) — dy — ctt{v)^ conditioned in 
'L’ G 5, is distributed as The probability of the loss being most dy — red((i) is exactly 

Cpt 4 ^{dy - red(d)). 


E[J2C{d)] 


^^Pr['?; G S] Pr[loss('?;) < dy — Ted{d)\v G S] 

V 

- PtA]Gpt,dvidv - red(d)) 

V 

-red(d))[l - (1 - pA]n{r) 

r>Ted{d) 


By Lemma6, [1 — (1 — pt)^]n(r) = E[|*St(r)|] 


□ 


3.2 Making sense of Theorem? 

Fix pt and d. Consider — red((i)) as a function of r, and suppose it had value 0 for r < d, 

and value 1 for r > d. Think of this as the ideal value for this function. Then, by Theorem?, 
^(^)] ~ which would be exactly what we want. We prove that the “coeffi¬ 
cients” — red(d)) behave like a step function with a transition roughly at d. So 'Ei[J2r>d 

is a sort of smoothed version of E[|5t(r)|]. 

We begin with some approximations for h is useful to think of the limit as p ^ 0 and 

reparametrize as d = By Lemma 3, 




- k{l - pA/p^ 
Pt{l - (1 -pA/p*) 

1 — Pt — e~^ — ke~^ 


Pt(l 


— P 


1 fee ^ Gi 7 

Pt pt{l -e Pt 


-k\ 


Thus, red(d) = k/pt — 1/pt + ke ^/pt- Now consider some r = x/pt > red(p). 




red(d)) = 




1 _ (1 _ '=) 

1 - (1 - PtY/P 

1 — exp(—(a: — k + 1 — ke~^)) 

1 — exp(—x) 


( 1 ) 


Clearly, as x becomes large, this expression goes to 1. The minimum possible value of x is 
k — 1 + ke^ (equivalently, r — red(d)), for which the expression is 0. It behaves roughly like a step 
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(a) k = b. (h) k = 10. (c) k = 100. 


Figure 2: Plots of Cp^^r according to (1) for different values of k. Note that r is set to x/k. In each 
plot, the thin vertical line is x = /c, and the dashed and dotted lines correspond to values of 0.01 
and 0.09, respectively. 


function, with a transition point (roughly) at fc — l-\-ke~^. As k becomes large, the transition point 
is A; — 1, close to k. When k is small, the extra ke~^ additive terms ensures the transition is closer 
to k. Of course, as k becomes smaller, the function looks less like a sharp transition function. This 
is shown in Fig. 2. We plot Cp^^r according to (1) for fc = 5,10,100. The red vertical line is x = k 
(so r = k/pt)^ and we draw dashed vertical lines corresponding to value 0.1 and 0.9. The width 
between the dashed lines is a rough measure of the error in approximation. Observe how it is fairly 
close to a step function for A; = 10, and is a coarser approximation for A; = 5. 

Hence, E[X^^>^C(r)] is much further from ^[J2r>d l*S't (^))|]5 and estimate provides worse re¬ 
sults. But we set dthr > log(l/^)/Pt* So for degrees close to 1/pt, we do not use the tail estimator. 

4 The Relative Hausdorff distance 

One of the main challenges in experimentally validating the behavior of estimate is in defining a 
distance between ccdhs. As we hinted earlier, existing statistical distances do not capture “similar¬ 
ity” of ccdhs. Motivated by concerns (detailed below), we define a new notion of distance between 
ccdhs (technically, between cumulative complementary histograms). This is inspired by the geo¬ 
metric notion of Hausdorff distance between subsets of a metric space. We say a ccdh is non-trivial 
if it contains some non-zero point. 

Definition 3. Let F and G he non-trivial eedhs. Fix non-negative numbers e^6. The distributions 
F and G are {s^ 5)-close by Relative Hausdorff (RH) distance if: 

Vd, 3d' G [(1 — e)d^ (1 -h 6:)d], sueh that \F{d) — G(d')| < 6F{d). 

(An analogous eondition holds with F and G switehed.) 

The LlYi-distanee between F and G (denoted RH{F^G)) is inf{6:|F and G are {e^e)-elose\. 

Note that the i?id-distance can be greater than 1. For F > e and 5' > d, if F and G are (s:, 6)- 
close, they are also (F, d')-close. Since F and G are non-trivial, we can set 6: to be large enough so 
that for some d, F and G are (s:, d)-close. Thus, the RH distance always exists. If RH{F^ G) = 0, 
then F and G are identical. 

Observe that RH distance tolerates error both in degree and frequency, which is very important 
for comparing degree distributions. The RH distance exactly captures the notion of being close 


11 





































in log-scale, but is a much more stringent condition. It forces all points in F to be close to some 
point in G (and vice versa). All “outlier” and tail behavior in F must be approximated in G. For 
RH-close ccdhs, the maximum degrees must be close, and furthermore, there must be approximate 
agreement for frequencies at all scales. 

To understand numerics, we think it is useful think of an RH-distance < 0.05 to be quite small. 
Suppose RH{N^N) < 0.05 for a true ccdh N and our algorithm output N. This means that for 
every reported point N[d) is within 5% of some N[d!)^ where d' is within 5% of d (and vice versa). 
Any RH distance greater than 1 is very large, since we only get closeness when s > 1. 

4.1 Problems with KS-statistic 

Fix two ccdhs F and G. A standard comparison metric is the Kolmogorov-Smirnov (KS) statistic, 
KS{F^G) — vadCKx\F{x) — G(x)|, where are normalized as distributions. (So F{d) is the 
fraction of vertices with degree at least d.) 

We discuss specific problems with the KS statistic and show how RH avoids these pitfalls. (The 
exact same issues also holds for normed distances, so we do not explicitly calculate these.) 

Comparing cliques: Let F be the ccdh of an n-clique and G be the ccdh of an (n — l)-clique. 
SoV0<i<n — 1, F{i) =n, V0<i<n — 2, G{i) = n — 1, and all other values are 0. The 
KS-statistic is actually 1 (which is extremely large), since G{n — 2) = 0 but F{n — 2) — 1. This is 
inconsistent with our intuitive notion that these degree distributions are similar. The RH distance 
is 0(l/n), since it allows for error in degree and frequency. 

Star vs matching: Let F be the ccdh of a star with n vertices, and G be the ccdh of a matching 
(disjoint edges) with n vertices. (Assume n is even.) So T(l) = n, V2 < z < n — 1, F{i) = 1, and 
other values are 0. We also have G{1) = n, and all other values are 0. The values of F that are 1 
are insignificant compared to the dominant F{1) = n. A calculation shows KS{F^G) = 0(l/n), 
though we should probably consider them different. On the other hand, RH{F^G) = 1 — 0(l/n). 
The “outlier” F{n — 1) = 1 forces the 6: to be 1 — ©(1/n), since G{i) = 0 for z > 1. 

Ignoring the tail: Let F be the ccdh of the as-Skitter graph, as plotted in Fig. la. Let G be 
the same ccdh up to degree 100 and zero afterwards. In other words, G is identical to F up to 
the “tail” starting at degree 100. The fraction of vertices with degree > 100 is at most 0.01. A 
calculation shows that KS{F^G) < 0.01. So ignoring a large portion of the tail still yields small 
KS-distance. The RH-distance is > 0.99, since £: needs to be large to handle the tail of F. 

5 Experimental Results 

We implemented the algorithm in Python and performed experiments on a Samsung NP-QX411L 
laptop with an Intel Core i5-2450M 2.5GHz four core processor and 5.7GB of memory. To simulate 
a stream, we convert a graph to a list of edges stored in a text file, and read the file one line at 
a time. In the case that the graph is directed, we treat it as undirected by considering each edge 
as an unordered pair of vertices. Note that this may imply multi- or parallel edges, though we 
calculate degrees for the actual ccdh respecting this notion. 

We test the algorithm on a number of graphs from the SNAP [16] and KONECT [43] collections, 
the statistics of which are summarized in Table 1. We use the as-Skitter graph on 1.7M nodes and 
IIM edges as a case study. 
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Figure 3: RH distance as the storage of headtail increase. 


We use the phrase storage o/headtail to indicate the total storage \Sh\ + \St\. As explained 
in Theorem!, this depends on and pf. 

5.1 Convergence of headtail 

We demonstrate how increasing the storage of headtail leads to convergence of the ccdh. We fix 
the as-Skitter graph. We increase the storage by letting p^ range from 0.01 to 0.1 in increments of 
0.01, and pt range from 0.01 to 0.16 in increments of 0.01. For each setting of ph and pt^ we perform 
five independent runs of headtail. We also run ten independent runs fixing ph — 0.005,pt = 0.01. 
For each such run, we compute the RH distance between the output of headtail with the true 
ccdh. The results are shown in Fig. 3. Observe how the RH distance goes to zero as the storage 
increases. In particular, headtail outputs a ccdh with RH distance as small as 0.03 using 230K 
space. 

We do a more nuanced study of how ph and pt affect convergence. In this experiment, we fix a 
value p/i and vary p^ in increments of 0.02. We repeat this process for p/^ = 0.01,0.025, 0.05, 0.075,0.1. 
The RH distances of the runs are plotted in Figure 4. Each line in the plot corresponds to a fixed 
Ph value, and the RH distances are plotted against p^. We point out that an RH distance of about 
0.04 is achieved with head and tail probabilities as small as 0.025,0.03, respectively, resulting in 
a total sample size of 82K or 0.7% of the edge stream. For each fixed p/^, increasing pt initially 
decreases the RH distance, but it eventually converges to a non-zero value. This is because all the 
error is coming from the head estimate. As we increase p/^, the convergence value goes down to 
zero, as expected. 

5.2 Results for various graphs 

Here we demonstrate the quality of the estimates output by headtail on a variety of graphs. Each 
of the graphs are from the SNAP graph collection [16] with the exception of the youtube and 
youtube-friendship graphs which are from the KONECT [43] collection. The node and edge set 
sizes of each graph are given in the second and third columns of Table 1, respectively. For each 
graph we include the storage of the algorithm and the RH distance of the estimate for two example 
runs. The storage is less than < 1% in almost at runs, and certainly less than < 2%. Observe how 
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Figure 4: RH distance of the estimate output by our algorithm as ph and pt vary. Each 
line in the plot correponds to a fixed value for ph^ and plots the RH distance as pt varies. 
A near optimal RH value is achived with ph = 0.025 and pt = 0.03, which yielded sample 
sets with \Sh\ + \St\ ~ 0.007m. 


the RH distance is usually less than 0.1. In our worst examples, (soc-Pokec and com-Orkut), the 
RH distance is less than 0.15. We stress that RH distance is a rather stringent condition, since it 
requires closeness of the estimate at all degrees. 

In Fig. 1 of the introduction, we have plotted the actually ccdh and the output of headtail for 
three of these graphs. Observe the near identical match in all examples. 

5.3 Errors at different scales 

Here we investigate how well headtail performs at different scales. Specifically, we measure the 
error of a ccdh estimate at each degree. Let N be the ccdh of the as-Skitter graph, and N be 
the headtail output. The RH distance is maximized over all degrees, so we do a more detailed 
analysis of the estimate errors. We fix a value for 6: and for each degree d compute the minimum 
value 6 such that 3d' G [(1 — e)d^ (1 -\-e)d] where \N{d) — N{d')\ < 6N{d) and vice versa. In words, 
we are “opening up” the definition of RH-distance and looking at the profile for every degree. 

We performed a run of headtail with ph = 0.01 and pt = 0.0007 for the as-Skitter graph. This 
used a storage of 31K (< 0.5% of stream). We then plot in Fig. 5 the corresponding 6 values with s 
set to 0.1 . The red ‘x’ markers denote the 5-values for headtail (the other markers are explained 
later). Observe how the 5 values are quite small throughout, and peak at degree 100 to roughly 
0.08. In this case, headtail achieves an RH-distance of about 0.1 with 31K space. 

5.4 Comparing to other methods 

While there is no existing small-space algorithm that has demonstrable convergence to the ccdh, 
there are numerous algorithms to only capture the tail. These are classic “heavy hitters” algo¬ 
rithms: the frequent algorithm [20, 21, 22], the lossy eounting algorithm [23], and the spaee saving 
algorithm [24]. We study the performance of these methods. For convenience, we use “head estima¬ 
tor” to denote the algorithm that simply takes uniform samples of vertices and uses their degrees 
to estimate the full ccdh. This is basically what headtail employs for d < dthr- 
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Space 

RH distance 

youtube 

I.IM 

3M 

21K 

90K 

0.1 

0.076 

wiki-Talk 

2.3M 

5M 

38K 

74K 

0.1 

0.055 

youtube-friendship 

3M 

9M 

80K 

196K 

0.067 

0.05 

as-Skitter 

1.7M 

IIM 

31K 

69K 

0.1 

0.073 

soc-Pokec 

1.6M 

30M 

75K 

212K 

0.29 

0.14 

com-Live Journal 

4M 

34M 

335K 

467K 

0.08 

0.058 

com-Orkut 

3M 

117M 

273K 

387K 

0.14 

0.13 


Table 1: Performance of headtail for a number of graphs. 
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Figure 5: RH distances at different degrees. We plot the 5-distance for 6: = 0.1. The 
red ‘x’ markers correspond to an estimate output by headtail using a storage of 31K. 
The estimate is (0.1, 0.08)-far from the true ccdh. The rest of the plots correspond to 
combinations of the head estimator using 17K space and the heavy hitter algorithms 
using 34K space for a total of 51K space. The lossy counting estimate is (0.1,1.5)-far 
from the true ccdh, the space saving estimate (0.1,0.4)-far and the frequent estimate is 
(0.1,0.33)-far from the true ccdh. 
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We fix the as-Skitter graph, and set the storage used by these algorithms to 35K. (Note that with 
storage 31K, headtail gives an estimate with RH-distance less than 0.1.) We show the resulting 
estimates of these algorithms in Fig. 6. Not surprisingly, none of these algorithms give reasonable 
estimates for N{d)^ where d < 10^. 

At the face of it, the above algorithms perform reasonably well on the tail. The head estimator 
(which is quite simple) seems to work well for the head. Could we just combine these algorithms, 
and outperform headtail? We show that this is not the case. Crucially, none of these algorithms 
actually get accurate estimates even at the moderate to high degrees, despite the apparent closeness 
in the log-log plot of Fig. 6. 

We convert the existing algorithms for the full ccdh, by combining with the head estimator. 
Pick (say) the algorithm frequent. We first run the head estimator with 20K space. We choose an 
appropriate dthr^ where we apply the head estimator for d < dthr^ frequent for d > dthr- We 
pick the dfhr that minimizes the RH distance to {A^((i)}. We do the same for each of frequent^ spaee 
saving.^ and lossy eounting. Note that we are being extra generous to the competing methods. First, 
the total storage used is about 50K. Furthermore, we choose the dthr fo minimize RH distance, 
while headtail chooses it based on a fixed formula. 

The RH distance we achieved was 0.3 {frequent)^ 0.5 {spaee saving)^ and 1.5 {lossy eounting). 
All of these used storage 50K. In contrast, headtail had RH distance of 0.1 with 31K storage. 
We measure the errors at all scales in Fig. 5, for all these algorithms. This is exactly using the 
explanation in previous section, by setting 6: = 0.1, and plotting the 6 values for all the estimates. 

We immediately see how the 5-values (errors) for all the competing procedures are much higher 
than headtail. Indeed, for degrees around 10^, the errors of the other procedures are extremely 
high, despite higher storage. We see that headtail handily beats all the procedures, at pretty 
much all scales simultaneously. In Fig. 7, we plot the output ccdh for the head estimator combined 
with frequent. As expected from Fig. 5, we see a fair amount of fluctuation from the true ccdh in 
the the intermediate to high degrees. We stress that a small fluctuation in a log-log plot is actually 
a fairly large error in the RH measure. 

For completeness, we increase the storage of the competing methods to get RH distance of 
around 0.1. For all the other algorithms, we require storage more than 150K to get comparable 
error to what headtail gives with 31K storage. 

5.5 Results for different stream orderings 

As stated previously, our algorithms do not assume any stream order. In this section we test 
the performance of the algorithm when provided the stream in different orderings. We use six 
different orderings in total. The first three are different random orderings. The second three are 
each edgelists (that is, all the edges adjacent to a particular node are read in sequence), but the 
orderings of the nodes are different. In one, we read the nodes of highest degree first, in another 
we read the nodes in increasing order of degree, and in the last we consider a random ordering of 
the nodes. In each experiment we let ph = 0.01 and pt = 0.04. The standard deviation of the RH 
distances for each ordering is 0.009. Table 2 summarizes the RH distance of estimated ccdhs with 
different stream orderings. 
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Figure 6: ccdh estimates output by the fre- Figure 7: ccdh estimate output by the head 

quent^ lossy counting^ and space saving algo- estimator combined with the frequent algo¬ 
rithms each using a storage of 35K. rithm using a storage of 50K. The RH dis¬ 

tance is 0.33. 


Ordering 

RH distance 

Randoml 

0.068 

Random2 

0.06 

Random3 

0.07 

Edgelist: Decreasing order of degree 

0.08 

Edgelist: Increasing order of degree 

0.083 

Edgelist: Random 

0.061 


Table 2: Performance of headtail for different stream orderings. The first three are 
different random stream orderings. The second three are edgelists permuted by the nodes. 
In each trial ph = 0.01, pt = 0.04. 
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